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EFFICIENT REAL-TIME ANALYSIS METHOD OF ERROR LOGS FOR 

AUTONOMOUS SYSTEMS 



BACKGROUND OF THE INVENTION 

Field of the Invention 

5 [0001] This invention is related in general to the field of data storage systems. In 
particular, the invention consists of a pattern analysis method used to dynamically detect 
errors and generate weighted numeric values. 

Description of the Prior Art 

[0002] Error logs are generated by systems such as mechanical systems, computer 
10 systems, and information systems in response to system faults or anomalous conditions. 
These systems often include an error logging and analysis component ("ELA") to log the 
error, analyze the failure, and initiate mitigating action in real-time. Systems that 
experience repetitive errors may utilize analysis techniques to recognize erroi 4 patterns. 

[0003] Data storage systems such as computer hard disk drives, redundant arrays of 
15 independent/inexpensive disks ("RAIDs"), or structured random access memory 

("RAM") can benefit from error pattern analysis to determine the source of repetitive 
errors or to predict system failure. However, error pattern analysis traditionally has been 
difficult to implement in complex systems. Real-time pattern analysis has generally been 
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limited by space (required to store error messages), processing resources, and the amount 
of time required to detect and analyze error patterns. 

[0004] Newer ELA components utilize time-based methods to determine if a fault is 
statistically relevant. A common technique is to sum the number of fault events of a 
5 particular type over a time interval and compare this to a predetermined threshold. These 
time-based methods are relatively simple and effective in overcoming the problems of 
storage-space, processing resources, and time. However, time-based methods are not 
efficient when used in complex software/hardware systems because they do not 
effectively detect problems that develop over large periods of time. This can potentially 
10 result in an unexpected loss of a resource or catastrophic system failure. Additionally, 
time-based ELA systems have difficulty managing errors that occur in clusters, i.e., large 
numbers of errors over a small period of time interspersed with long error-free periods. 

[0005] In U.S. Patent No. 5,463,768, Paul Cuddihy et al. disclose an error log analysis 
system comprising a diagnostic unit and a training unit wherein the training unit includes 
15 a plurality of historical error logs. Sections of error logs that are in common with other 
historical error logs are identified and labeled as blocks. Each block is then weighted with 
a numerical value that is indicative of its value in diagnosing a fault. However, this 
system does not assign error weights to individual error instances. Additionally, proper 
implementation of this system requires that error analysis be order or time dependent. 
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[0006] In U.S. Patent No. 6,625,589, Anil Varma et al. disclose an algorithm for 
improving the probability of identifying a repair that will correct a fault utilizing a 
historical fault log and calculating the number of times a fault occurs in a given period of 
time. Faults which occur with a frequency greater than the average are considered 
5 statistically significant. However, weights are not assigned to individual errors to assist 
the fault analysis process. Accordingly, it would be advantageous to have an error 
logging system that utilizes error severity and occurrence to generate a weighted error 
rate. Additionally, it would be beneficial to compare these weighted error rates to a 
predetermined threshold to assist in predicting component failure. 

10 
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SUMMARY OF THE INVENTION 



[0007] The invention disclosed herein is an error logging system that utilizes a 
weighted frequency-based approach to error analysis. The weight of common types of 
errors is computed in real time by computing the frequency and severity of these errors. 

5 [0008] One aspect of this invention is the assignment of an initial error severity weight 
to an initial occurrence of a fault. When a like error subsequently occurs, the time period 
between the two faults is used to generate an error frequency factor. This error frequency 
factor is added to the initial error severity weight to produce a weighted error rate. When 
additional like errors occur, new weighted error rates are produced by summing the initial 
10 error severity weight, the new error frequency factor, and a percentage of the prior 
weighted error rate. The resulting weighted error rate is then compared to a 
predetermined threshold to determine if the fault is statistically significant. 

[0009] Various other purposes and advantages of the invention will become clear from 
its description in the specification that follows and from the novel features particularly 
15 pointed out in the appended claims. Therefore, to the accomplishment of the objectives 
described above, this invention comprises the features hereinafter illustrated in the 
drawings, fully described in the detailed description of the preferred embodiments and 
particularly pointed out in the claims. However, such drawings and description disclose 
just a few of the various ways in which the invention may be practiced. 
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BRIEF DESCRIPTION OF THE DRAWINGS 



[0010] Figure 1 is a schematic diagram of an error logging and analysis device in 
accordance with the invention including an error detection unit, an error logging device, 
and an error analysis unit. 

5 [0011] Figure 2 is a schematic diagram of one embodiment of the error logging and 
analysis device of Figure 1, wherein the error logging device is a digital storage device 
and the analysis unit is a computer processor. 

[0012] Figure 3 is a flow-chart illustrating the process of detecting fault conditions, 
storing information related to the fault conditions, analyzing the severity and frequency 
10 of like fault conditions, assigning a weighted error, rate to like fault conditions, comparing 
the resulting weighted error rate to a predetermined threshold, and ascertaining if the 
common fault conditions are statistically significant. 



IBM Docket No. TUC920040006US1 



5 



DESCRIPTION OF THE PREFERRED EMBODIMENTS 



[0013] This invention is based on the idea of using an error logging and analysis 
("ELA") device to detect errors, assign weighted error rates to these errors, and compare 
the weighted error rates to predetermined thresholds. Referring to the figures, wherein 
5 like parts are designated with the same reference numerals and symbols, Fig. 1 is a 

schematic illustration of an ELA system 10 including an error detection unit 12, an error 
logging device 14, and an error analysis unit 16. The ELA system 10 may be 
implemented in almost any system using real-time logging and analysis such as 
mechanical systems, information systems, and computer systems. 

10 [0014] The invention disclosed herein may be implemented as a method, apparatus or 
article of manufacture using standard programming or engineering techniques to produce 
software, firmware, hardware, or any combination thereof. The term "article of 
manufacture" as used herein refers to code or logic implemented in hardware or computer 
readable media such as optical storage devices, and volatile or non- volatile memory 

15 devices. Such hardware may include, but is not limited to, field programmable gate arrays 
(FPGAs), application-specific integrated circuits (ASICs), complex programmable logic 
devices (CPLDs), programmable logic arrays (PLAs), microprocessors, or other similar 
processing devices. 
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[0015] Optical storage devices may include compact-disk read-only memory devices 
(CD-ROMs) or other types of optical disks. Volatile and non- volatile emory devices 
include programmable read-only memory (PROM), erasable read-only memory 
(EPROM), electrically-erasable programmable read-only memory (EEPROM), random- 
5 access memory (RAM), static random-access memory (SRAM), dynamic random-access 
memory (DRAM), magnetic disk drives, tape cartridges, and other types of data storage 
devices. 

[0016] Algorithmic instructions that are placed into computer readable media are 
retrieved and implemented by the processing device. These algorithmic instructions may 
be accessed through any transmission media that can accommodate the transmission and 
reception of digital data such as local area networks (LANs), wide area networks 
(WANS), wireless networks, or the Internet. Those skilled in the art will recognize that 
modifications may be made to the configurations set forth below without departing from 
the scope of the present invention, and that the article of manufacture may comprise any 
medium capable of storing digital information. 

[0017] One embodiment of the invention is illustrated in the schematic drawing of 
Fig. 2. In a computer system 20, an EL A system 10 is connected to a computer network 
22. A processing device 24 acts as both the error detection unit 12 and the error analysis 
unit 16. This processing device 24 may be a field-programmable gate array ("FPGA"), a 
20 complex programmable-logic device ("CPLD"), an application-specific integrated circuit 
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("ASIC"), a general purpose processor ("CPU"), a micro-processor, or other similar 
computer processing device. A memory device 26, such as a random access memory 
("RAM") integrated circuit, is used to store information about error conditions. 

[0018] Fig. 3 is a flow-chart illustrating the process of analyzing fault conditions, 
5 utilizing an error logging and analysis algorithm 28. This algorithm may be implemented 
as either a hardware construct or a software application. In step 30, the processing device 
24 monitors the computer network 22, actively listening for error messages. When a new 
type of error message is detected, the processing device 24 assigns an initial severity 
weight ("ISW") to the error message in step 32. In this embodiment of the invention, this 
1 0 initial severity weight is proportional to the potential impact this type of error may have 
in creating a failure of the computer system 20. The initial severity weight and the time 
the error condition occurred is recorded to the memory device 26 in step 34. 

[0019] In step 36, when a subsequent error of a like type is detected, the processing 
device 24 determines the time interval between the initial error and the new error. Based 

1 5 on this time interval, the processing device 24 calculates an error frequency factor 
("EFF") in step 38. In this embodiment of the invention, this error frequency factor is 
inversely proportional to a predetermined base number representative of a period of time 
such as a minute, a day, a month, or a year. Accordingly, the smaller the intervening time 
period, the greater the error frequency factor. This error frequency factor is added to the 

20 initial error severity weight to generate a weighted error rate WER in step 40: 
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WER = ISW + EFF. 



[0020] In step 42, the weighted error rate is compared to a predetermined threshold. If 
the weighted error rate exceeds the predetermined threshold, the processing device labels 
the current error message as statistically significant and requests remedial action in step 
5 44. This request for remedial action may be in the form of an alert message displayed on 
a computer screen, an email sent to a user, or a text document sent to a printer. When 
subsequent like types of error messages are detected, the process returns to step 36. 
Additionally, step 40 is modified to include a trend factor ("TF") designed to indicate a 
percentage of the previous weighted error rate. For example, 

10 New WER = ISW + EFF + (TF) X (Old WER). 

[0021] Those skilled in the art of making error analysis and logging systems may 
develop other embodiments of the present invention. For example, separate processing 
devices may be used as the error detection unit and the error analysis unit. Additionally, 
the invention can be implemented with a processing device containing a memory device 
1 5 utilized for error logging. 

[0022] The terms and expressions which have been employed in the foregoing 
specification are used herein as terms of description and not of limitation, and there is no 
intention in the use of such terms and expressions of excluding equivalents of the features 
shown and described or portions thereof, it being recognized that the scope of the 
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invention is defined and limited only by the claims which follow. Other embodiments of 
the invention may be implemented by those skilled in the art of error detection. 
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